The
Teradata database provides several Application Utilities for processing
large numbers of INSERTs, UPDATEs, and DELETEs in a batch environment.
Each utility exploits the capabilities provided by the Teradata parallel
architecture for a specific data maintenance or batch-processing
activity. Teradata application utilities are supported on several
hardware platforms including a wide range of channel-connected
mainframes. Regardless of the host platform, however, all access between
the host and the Teradata database relies on the Call Level Interface
(CLI), a series of callable subroutines that reside in the host's
address space. CLI is responsible for creating and managing the parcels
that travel back and forth between Teradata and the host. It permits the
host to send multiple tasks (sessions) to Teradata at the same time.
CLI is the vehicle that makes parallel access possible.
Teradata has been doing data transfers to and from the largest data warehouses in the world for close to two decades. While other databases have allowed the loads to break them down, Teradata has continued to set the standards and break new barriers. The brilliance behind the Teradata load utilities is in their power and flexibility. With six great utilities Teradata allows you to pick the utility for the task at hand.
Teradata has been doing data transfers to and from the largest data warehouses in the world for close to two decades. While other databases have allowed the loads to break them down, Teradata has continued to set the standards and break new barriers. The brilliance behind the Teradata load utilities is in their power and flexibility. With six great utilities Teradata allows you to pick the utility for the task at hand.
Here is an introduction to the six Teradata Utilities starting with the first three of BTEQ, FastLoad and MultiLoad. BTEQ
was the first Teradata query tool and first utility because it was
built as a report writer, but it also imports and exports data 1-row at a
time.
FastLoad
is used to load to Teradata tables that are empty in 64K blocks. This
is a mover and shaker and always feels the need for speed. Sorry though
because your table can’t have Secondary Indexes when loading, join indexes, triggers, or referential integrity.
You can however use FastLoad to load the table and then add your
Secondary Indexes, Join Indexes, Triggers, and Referential Integrity.
The only command that FastLoad needs to know is INSERT because it INSERTS into empty Teradata tables by loading 64K blocks of rows (could be hundreds to thousands of rows with a single block load).
The only command that FastLoad needs to know is INSERT because it INSERTS into empty Teradata tables by loading 64K blocks of rows (could be hundreds to thousands of rows with a single block load).
MultiLoad
is like FastLoad in that it also loads in 64K blocks so it is also
considered a block utility. BTEQ is not a block utility because it works
a row at a time. Where FastLoad only understands the word INSERT because it only INSERTS into empty tables, MultiLoad is used to populate populated tables.
The idea is to use FastLoad to load to an empty table the first time
and then use MultiLoad each time you want to add to the table. MultiLoad
understands the words INSERT, UPDATE, DELETE and UPSERT.
FastExport
is another block Utility that works in 64K blocks, just like FastLoad
and MultiLoad, but FastExport only exports Teradata data off of
Teradata.The only word that FastExport understands is SELECT. You SELECT
the data from the table and then FastExport exports it off Teradata in
64K blocks.
TPump is one of the most exciting utilities.
It works a row at a time so it is slower than FastLoad or MultiLoad,
but you can have Secondary Indexes, Join Indexes, Referential Integrity
and Triggers on your table while you load it. Why would you
use something slower like TPump when you can rapidly load using
MultiLoad? Because users can continue to query a table while TPump
quietly INSERTS, UPDATES, UPSERTS, or DELETES rows in the background.
Think
of MultiLoad as a noisy train coming down the tracks disrupting
everything in its path and TPump as a quiet truck loading to its
destination.
As
mentioned above, there are efficiencies associated with using large
blocks of data when transferring between computers. So, the logic might
indicate that it is always the best approach. However, there is never
one best approach.
You
will learn that efficiency comes at the price of other database
capabilities. For instance, when using large blocks to transfer and
incorporate data into Teradata the following are not allowed:
1. Secondary indexes
2.Triggers
3.Referential integrity
More
than 15 concurrent utilities running at the same time Therefore, it is
important to understand when and where these considerations are present.
Before
Teradata V2R6.0 when Tera-Tom was just a baby in the crib the DBS
Control parameter MaxLoadTasks had a maximum limit of 15. This meant no
more than a total combination of block utilities (FastLoad, MultiLoad
and FastExport could run simultaneously. Many companies set this to 5
because these Block Utilities will have major impact on a system and can
greatly affect user query performance.
After
Teradata V2R6.0 Teradata increased and changed this number. It no
longer includes FastExport. Let me explain. Now there can be up to 30
concurrent FastLoad and MultiLoad jobs, but remember it is up to each
individual company to determine if this is too many because of the
performance hit. For FastExport jobs up to 60 can run concurrently. The
only caveat here is that 60 FastExports can run simultaneously (minus
the number of active FastLoad and MultiLoad jobs also running). This new
feature is actually controlled by a new DBS Control parameter named
MaxLoadAWT, which controls AMP Worker Tasks (AWT). When MaxLoadAWT is
set to zero then it is like going back in time to pre-V2R6.0 where only
15 FastLoad, MultiLoad and FastExports can run max.
When
MaxLoadAWT is greater than zero the new feature is active. Each AMP can
perform 80 things at once, thus meaning that 80 AMP Worker Tasks per
AMP. The MaxLoadAWT should never exceed more than 48 or the AMPs would
not be able to do much else during the load.
Block
level utilities have speed but so many restrictions. The opposite of
sending a large block of rows at the same time is sending a single row
at a time. The primary difference in these approaches is speed. It is
always faster to send multiple rows in one operation instead of one row.
If
it is slower, why would anyone ever use this approach? The reason is
that it provides more flexibility with fewer considerations. By this, we
mean that the row at a time utilities allow the following:
1. Secondary indices
2. Triggers
3.Referential integrity
4.More than 15 concurrent utilities running at the same time
As
you can see, they allow all the things that the block utilities do not.
With that in mind and for more information, continue reading about the
individual utilities and open up a new world of capabilities in working
with the Teradata RDBMS.
Fast Path Inserts inside the Teradata Database
The
load utilities such as BTEQ, FastLoad, MultiLoad, TPump and TPT are
designed to import or export data to and from Teradata, but it is also
important that you understand that once you are inside Teradata you can
use an INSERT SELECT from one Teradata table to another and get great
speed.
If
the target table you are loading to starts empty then there isn’t a
large amount of writing to the Transient Journal. The Transient Journal
is designed to Rollback bad transactions, but since the table starts
empty there is only one write to the Transient Journal and then it is
idle. If the Transient Journal needs to rollback the table it just
empties it like it started.
For
the Fast Path to be taken both the target and the source table must
have the same Primary Index. That way no data has to be moved across the
AMPs via the BYNET and Teradata can just copy and insert the blocks
directly. This is why it is called the Fast Path.
BTEQ
Why
is BTEQ available on every Teradata system ever built? Because the
Batch TEradata Query (BTEQ) tool was the original way that SQL was
submitted to Teradata as a means of getting an answer set in a desired
format. Here is what is excellent about BTEQ:
- BTEQ can be used to submit SQL in either a batch or interactive environment. Interactive users can submit SQL and receive an answer set on the screen. Users can also submit BTEQ jobs from batch scripts, have error checking and conditional logic, and allow for the work to be done in the background.
- BTEQ outputs a report format, where SQL Assistant outputs data in a format more like a spreadsheet. This allows BTEQ a great deal of flexibility in formatting data, creating headings, and utilizing Teradata extensions, such as WITH and WITH BY that SQL Assistant has problems in handling.
- BTEQ is often used to submit SQL, but is also an excellent tool for importing and exporting data.
Importing Data:
Data can be read from a file on either a mainframe or LAN attached
computer and used for substitution directly into any Teradata SQL using
the INSERT, UPDATE or DELETE statements.
Exporting Data:
Data can be written to either a mainframe or LAN attached computer
using a SELECT from Teradata. You can also pick the format you desire
ranging from data files to printed reports to spread sheet formats.
Logging into BTEQ
When
logging onto BTEQ in interactive mode you will type .LOGON followed by
your TDP-ID. The TDP-ID identifies your system. You could have multiple
Teradata systems, such as a production system and a test system. The TDP
is the name of the system you are logging onto.
Then
you enter your User-ID. You will then be prompted for your password,
which is hidden from view as you type it in. Remember that BTEQ commands
begin with a period (.) and do not require a semi-colon (;) to end the
statement. SQL commands do not ever start with a period and they must
always be terminated with a semi-colon.
Using BTEQ to submit queries in Interactive Mode
Once
you logon to Teradata through BTEQ, you are ready to run your queries.
Teradata knows the SQL is finished when it finds a semi-colon, so don’t
forget to put one at the end of your query.
Why
do you need to place a semi-colon behind all SQL? That is how Teradata
knows the BTEQ query is finished. You might write a very long query and
have to hit ENTER many times. But once you hit ENTER and Teradata sees a
semicolon it is like hitting EXECUTE on SQL Assistant or the Nexus
Query Chameleon.
The first query merely makes SQL_Class the default database. The second query selects from the Employee_Table.Then you see your result set.Take
a look at the SQL on the following figure. It does a simple SELECT *
FROM Employee_Table, but also notice the WITH (SUM) BY Dept_No.
This
is the WITH BY statement that will allow for detail rows being reported
with Sub-Totals on Dept_No breaks. The WITH BY Statement only works in
BTEQ. Notice that each detail line shows information about a particular
individual in a particular department number. Then when there is a new
department number the report will break and show the SUM (Salary) for
that particular department. This is an example of the WITH BY Statement
showing subtotals.
The Four Types of BTEQ Exports
There
are four types of BTEQ Exports. BTEQ allows for multiple techniques to
export data. We usually think of an export as moving data off of
Teradata to a normal flat file.That is example number one and that is
called RECORD Mode.
Sometimes
there are NULL’s in your data and if you export them to a mainframe the
actual mainframe application could run into problems interpreting them.
That is why INDICDATA is designed to place bits in the front of the
records to warn of NULL values.
BTEQ
can actually take your SQL output report and include the Headers and
export all together. It looks like an electronic report. That is EXPORT
REPORT mode. This is also called Field Mode. Reports are truncated to
254 characters for mainframe and 75 for network attached computers. You
can set your report up with REPORTWIDE which effectively sets the width
to 32,765 characters (not supported in all releases). The .SET Width
command in BTEQ can be used to set the width to a range ranging from
20-65,531.
The
last mode is DIF and this is used when you want certain flat files to
be able to be used by PC applications that utilize the Data Interchange
Format.
A BTEQ Export Script
Exporting File |
Exporting Report File |
The
following pictures shows how to export data in a normal flat file and
a Report flat file. The first statement is the LOGON statement. The
last statement is the LOGOFF statement. You basically tell BTEQ you are
going to export a file. Then you run a SELECT Query and BTEQ Exports it!
Data
into a normal flat file like garbage. This is actually perfect. That
is how it should look. Notice the Report File on the bottom example. It
does look perfect and it is also perfect. This is the difference between
our RECORD Mode and our REPORT Mode.
Output for Record Mode and Report Mode |
Report Mode to Export a CSV File |
You
can use Report Mode to CREATE a Comma Separated Value (CSV) file. This
is accomplished through the Pipe Symbol (just above the ENTER key on
your keyboard). This is actually accomplished with using two pipe
symbols ||. That means concatenate. If the records are greater than 75
characters (the default for Report Mode) then use the .SET Width Command
to lengthen this.
Importing from a CSV File
Two Ways to run a BTEQ Script
COMMAND PROMPT- BTEQ <C:\HOME\PATH\SCRIPT NAME .
- .RUN FILE =C:\HOME\PATH\SCRIPT NAME
Exporting large files in a UNIX MP-RAS System
Many
a large thing has been aborted when using UNIX MP-RAS and BTEQ to
export a file. This is because UNIX MP-RAS has always had a 2GB limit.
This
is taken care of with 64-bit operating systems and I have a solution
for you if you are still using UNIX MP-RAS as your operating system.
Just
use the keyword AXSMOD in your BTEQ export DATA FILE command. You don’t
need this unless you are using UNIX MP-RAS and exporting a file larger
than 2GBs.
BTEQ Import Scripts
The
following figure shows an example of a BTEQ IMPORT Script. We are
taking data from our flat file called C:\Temp\CustData.txt and importing
the records into the SQL_Class.Customer_Table.
A
CSV file is a Comma Separated Value flat file. Each column is separated
by a comma. We have taken our comma separated value flat file called
C:\Temp\var1.txt and we are importing that to our
SQL_Class.Customer_Table. Notice how each column is defined using VARCHAR. This is necessary when using CSV files.
Multiple Sessions in BTEQ
Teradata
allows you to set multiple sessions in a BTEQ script. However, this
will only work if your SQL is using the Primary Index or a Unique
Secondary Index so UPI, NUPI, and USI are the only options that will
utilize multiple sessions.
The
SESSIONS parameter really merely tells Teradata how many times to login
to Teradata to take advantage of multiple sessions. You can utilize up
to 200 sessions for all types of Operating Systems as long as you are
using at least Teradata V2R5. Typically BTEQ IMPORTS will take advantage
of multiple sessions, but BTEQ EXPORTS will not.
TPump
TPump
is the shortened name for the load utility Teradata Parallel Data Pump.
To understand this, you must know how the load utilities move the data.
Both FastLoad and MultiLoad assemble massive volumes of data rows into
64K blocks and then moves those blocks. Picture in your mind the way
that huge ice blocks used to be floated down long rivers to large cities
prior to the advent of refrigeration. There they were cut up and
distributed to the people. TPump does NOT move data in the large blocks.
Instead, it loads data one row at a time, using row hash locks. Because
it locks at this level, and not at the table level like MultiLoad,
TPump can make many simultaneous, or concurrent, updates on a table.
Envision
TPump as the water pump on a well; pumping in a very slow, gentle
manner resulting in a steady trickle of water that could be pumped into a
cup. But strong and steady pumping results in a powerful stream of
water that would require a larger container. TPump is a data pump which,
like the water pump, may allow either a trickle-feed of data to flow
into the warehouse or a strong and steady stream. In essence, you may
"throttle" the flow of data based upon your system and business user
requirements.
A Sample TPump Script
The script on the following page follows these steps:
1. Setting up a Logtable
2.Logging onto Teradata
3.Identifying the Target, Work and Error tables
4.Defining the INPUT flat file
5.Defining the DML activities to occur
6.Naming the IMPORT file
7.Telling TPump to use a particular LAYOUT
8.Telling the system to start loading
9.Finishing and log off of Teradata
2.Logging onto Teradata
3.Identifying the Target, Work and Error tables
4.Defining the INPUT flat file
5.Defining the DML activities to occur
6.Naming the IMPORT file
7.Telling TPump to use a particular LAYOUT
8.Telling the system to start loading
9.Finishing and log off of Teradata
Teradata Parallel Transport (TPT)
The
Teradata Parallel Transport (TPT) utility combines BTEQ, FastLoad,
MultiLoad, TPump, and FastExport utilities into one comprehensive
language utility. This allows TPT to insert data to tables, export data
from tables, and update tables.
TPT can simultaneously load data from multiple sources in a single job. It can also execute multiple instances of an operator. It can even export, transform and load one or more tables in a single job. All of this with the ability to perform inline filtering while also being able to transform data means that this single scripting feature adds enormous flexibility and power to the stand-alone utilities.
TPT’s major capability is to be able read/transform/load data and as the data is placed inside a data buffer, the next operator can start performing the next task without waiting for the prior process to complete. This is helped because TPT places data into streams, which means the data isn’t written to disk.
TPT DATA STREAM FLOW
TPT
works around the concept of Operators and Data Streams. There will be
an Operator to read Source data, pass the contents of that Source to
a data stream where another operator will be responsible for taking the
Data Stream and loading it to disk. Notice on the following page that
we have a Flat File that is our Source. A Producer Operator, designed to
read input will move the data to a Data Stream. The Consume Operator,
designed to write data to a Teradata table will then Load the data.
The data source or destination can be:
- Database (relational or non-relational)
- Server
- File
- Data Storage device
- Object (images, voice, pictures, text, etc.)
TPT can also perform a wide variety of processes while data is being streamed from source to destination, such as:
- Retrieve, store, and transport objects via parallel data streams
- Merging or splitting multiple parallel data streams
- Filter, massage, and cleanse data
TPT Operators
There are four types of TPT Operators.
- Producer
- Consumer
- Filter
- Standalone
Producer :- Producer get data from Teradata or other external sources such as Oracle and write it to stream, the idea is to make the data available to other operators.
Consumer :- It takes the data from the stream and then loading into Teradata and other external data source.
Filter :- The filter Operator Consume data from a stream and transform it in some way and then send it down the stream further.They perform data selection, validation, cleansing, condensing.
Standalone :- It is mainly used for DDL commands. They are also used as Update operators in DELETE task.Read From Multiple Source Files Simultaneously.
TPT can take multiple source and utilize multiple Producer Operators to reach each source and then move multiple Consumer Operators and the write into different multiple source to one Teradata Table.
No comments:
Post a Comment